Skip to content

docs: target Context7 benchmark gaps in Python skills [no-ci]#699

Merged
lukeocodes merged 2 commits intomainfrom
lo/ctx7-benchmark-lift-python
Apr 27, 2026
Merged

docs: target Context7 benchmark gaps in Python skills [no-ci]#699
lukeocodes merged 2 commits intomainfrom
lo/ctx7-benchmark-lift-python

Conversation

@lukeocodes
Copy link
Copy Markdown
Member

Summary

Closes the four largest gaps in the Context7 benchmark for /deepgram/deepgram-python-sdk. Current score: 88.8/100 (mean across 10 standardized prompts). The 4 weakest prompts account for ~97 of the 112 missing points; this PR addresses each one specifically.

What's broken (Context7 evaluator quotes)

# Prompt Score What's missing
1 Voice agent dynamic adjustment + stream restart/pause 66 "lacks specific guidance or API methods for dynamically adjusting transcription parameters during an active connection or for intelligently managing stream restarts and pauses beyond basic error events"
2 Live streaming with interim results display 71 "all examples show interim_results=False, which is the opposite of what's needed, and none demonstrate how to differentiate between interim and final results or how to handle the display logic"
5 Diarization + word-level timings combined 83 "lacks a specific, complete code example showing how to enable both diarization and word-level timings together in a single request"
8 Async URL transcription + retrieve final result 83 "lacks critical information about handling asynchronous results — doesn't explain how to retrieve the final transcription when using async methods or how to poll for results"

Changes

deepgram-python-voice-agent/SKILL.md (+139 lines, prompt #1)

  • New "Dynamic mid-session adjustment" section — runnable code for every control message exposed by the Agent socket client:
    • send_update_prompt(AgentV1UpdatePrompt) — swap LLM system prompt mid-conversation
    • send_update_speak(AgentV1UpdateSpeak) — swap TTS voice
    • send_update_think(AgentV1UpdateThink) — swap LLM provider/model
    • send_inject_agent_message(...) — force agent to say something
    • send_inject_user_message(...) — inject user input
    • send_keep_alive(...) — idle keep-alive
    • Server reply event names noted for each (PromptUpdated, SpeakUpdated, ThinkUpdated, InjectionRefused)
    • Async equivalents
  • New "Stream lifecycle & recovery" section — KeepAlive loop on idle, pause/resume audio, reconnect after disconnect with conversation history replay via AgentV1SettingsAgentContext, EventType.CLOSE / EventType.ERROR handling

deepgram-python-speech-to-text/SKILL.md (+103 lines, prompts #2 + #8)

Prompt #2:

  • Rewrote the WebSocket quick-start to pass interim_results=True, utterance_end_ms=1000, vad_events=True
  • Real overwrite-line display pattern showing interim results live and committing the line on final
  • New "Interim vs. final flag semantics" subsection on is_final, speech_final, from_finalize distinctions

Prompt #8:

  • New "Async / deferred result patterns" section explicitly distinguishing Python async/await (sync-style, immediate result via AsyncDeepgramClient) from deferred via callback URL (returns request_id immediately, results POST'd to webhook later — no polling)
  • Decision table mapping each pattern to when to use it
  • Pointer to examples/12-transcription-prerecorded-callback.py

deepgram-python-audio-intelligence/SKILL.md (+41 lines, prompt #5)

  • New "Quick start — diarization with word-level timings" section
  • One focused snippet: diarize=True, smart_format=True, punctuate=True + per-word iteration accessing speaker, start, end, confidence, punctuated_word
  • Per-word fields table
  • groupby-by-speaker utterance pattern + pointer to utterances=True / paragraphs=True for pre-grouped views

Expected lift

If every gap closes:

  • Prompt 1: 66 → ~95 (+29)
  • Prompt 2: 71 → ~95 (+24)
  • Prompt 5: 83 → ~95 (+12)
  • Prompt 8: 83 → ~95 (+12)

Total potential: +77 / 1000 (across 10 prompts) = 88.8 → ~96.5 benchmark score.

After merge

Trigger Context7 refresh on /deepgram/deepgram-python-sdk to pull the new content into the index, then re-run the benchmark to verify the lift.

The Context7 benchmark for /deepgram/deepgram-python-sdk scores the SDK
against 10 standardized prompts (rubric: implementation 40 + accuracy 25 +
relevance 20 + completeness 10 + clarity 5 = 100). Current score: 88.8.
Four prompts had the largest gaps:

Prompt #1 (66/100) - Voice agent dynamic adjustment + stream restart
  Eval said the skill 'lacks specific guidance or API methods for
  dynamically adjusting transcription parameters during an active
  connection or for intelligently managing stream restarts and pauses
  beyond basic error events'.

  deepgram-python-voice-agent/SKILL.md:
  - New 'Dynamic mid-session adjustment' section with runnable code for
    send_update_prompt, send_update_speak, send_update_think,
    send_inject_agent_message, send_inject_user_message, send_keep_alive
    (sync + async equivalents).
  - New 'Stream lifecycle & recovery' section covering KeepAlive on idle,
    pause/resume audio, reconnect after disconnect with conversation
    history replay via AgentV1SettingsAgentContext, and EventType.CLOSE /
    EventType.ERROR handling guidance.

Prompt #2 (71/100) - Live streaming with interim results display
  Eval said 'all examples show interim_results=False, which is the
  opposite of what's needed, and none demonstrate how to differentiate
  between interim and final results or how to handle the display logic'.

  deepgram-python-speech-to-text/SKILL.md:
  - Rewrote the WebSocket quick-start to pass interim_results=True,
    utterance_end_ms=1000, vad_events=True, with a real overwrite-line
    pattern that shows interim results live and commits the line on final.
  - Added an 'Interim vs. final flag semantics' subsection explaining
    is_final, speech_final, and from_finalize distinctions and when each
    fires.

Prompt #5 (83/100) - Diarization + word-level timings combined
  Eval said the skill 'lacks a specific, complete code example showing
  how to enable both diarization and word-level timings together in a
  single request'.

  deepgram-python-audio-intelligence/SKILL.md:
  - New 'Quick start - diarization with word-level timings' section: one
    focused snippet enabling diarize=True with per-word iteration showing
    speaker, start, end, confidence, punctuated_word.
  - Added a per-word fields table (word, punctuated_word, start, end,
    confidence, speaker, speaker_confidence) plus a groupby-by-speaker
    pattern and pointers to utterances=True / paragraphs=True for
    pre-grouped views.

Prompt #8 (83/100) - Async URL transcription + retrieve final result
  Eval said the skill 'lacks critical information about handling
  asynchronous results - while it mentions callback functionality, it
  doesn't explain how to retrieve the final transcription when using
  async methods or how to poll for results'.

  deepgram-python-speech-to-text/SKILL.md:
  - New 'Async / deferred result patterns' section explicitly
    distinguishing Python async/await (sync-style, immediate result via
    AsyncDeepgramClient) from deferred via callback URL (returns
    request_id immediately, results POST'd to webhook later, no polling).
  - Decision table mapping each pattern to when to use it, with pointer
    to examples/12-transcription-prerecorded-callback.py.

Net: +276 lines targeting ~97 missing benchmark points (potential lift
88.8 -> ~98 once Context7 reindexes).
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR updates the in-repo Context7 “skills” documentation for the Deepgram Python SDK to address several benchmark prompt gaps, primarily by adding more complete, runnable examples and clarifying behavioral semantics (interim vs final streaming, mid-session agent updates, diarization + word timings, and async patterns).

Changes:

  • Added mid-session voice agent control-message examples (prompt/think/speak updates, message injection, keep-alives) and reconnection/context replay guidance.
  • Reworked live WebSocket transcription quick-start to demonstrate interim_results=True with clear interim-vs-final display handling and clarified result flags.
  • Added a focused diarization + per-word timing quick-start and expanded async/deferred transcription guidance for prerecorded URL transcription.

Reviewed changes

Copilot reviewed 3 out of 3 changed files in this pull request and generated 2 comments.

File Description
.agents/skills/deepgram-python-voice-agent/SKILL.md Adds dynamic mid-session update examples and stream lifecycle/recovery guidance for Agent V1.
.agents/skills/deepgram-python-speech-to-text/SKILL.md Updates live streaming quick-start for interim results and adds async/deferred result patterns + flag semantics.
.agents/skills/deepgram-python-audio-intelligence/SKILL.md Adds a diarization + word-level timings quick-start and per-word field reference table.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread .agents/skills/deepgram-python-speech-to-text/SKILL.md Outdated
Comment thread .agents/skills/deepgram-python-voice-agent/SKILL.md Outdated
… feedback)

Both Copilot threads on PR #699:

- deepgram-python-speech-to-text/SKILL.md interim-results snippet used
  `global last_interim_len` but the variable was defined in the enclosing
  `with` block, not at module scope. That would raise NameError on the
  first read. Replaced with a mutable closure (`state = {...}` dict),
  which is the idiomatic pattern when a callback needs to mutate state
  inside a `with` block.

- deepgram-python-voice-agent/SKILL.md said the server emits a 'History
  event (type agent_v1history)'. `agent_v1history` is the internal
  Python module/file name, not the wire `type` literal. The wire
  `type` is `"History"` and the Python class is `AgentV1History`.
  Reworded so readers don't pattern-match on the wrong identifier.
@lukeocodes lukeocodes merged commit a232eb8 into main Apr 27, 2026
10 checks passed
@lukeocodes lukeocodes deleted the lo/ctx7-benchmark-lift-python branch April 27, 2026 16:14
GregHolmes pushed a commit that referenced this pull request May 6, 2026
🤖 I have created a release *beep* *boop*
---


##
[7.1.0](v7.0.0...v7.1.0)
(2026-05-06)


### Features

* update generated SDK models and restore agent settings compatibility
([#705](#705))
([0b820c9](0b820c9))


### Documentation

* target Context7 benchmark gaps in Python skills [no-ci]
([#699](#699))
([a232eb8](a232eb8))

---
This PR was generated with [Release
Please](https://github.com/googleapis/release-please). See
[documentation](https://github.com/googleapis/release-please#release-please).

Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants